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THE  INDEX  OF  INCONSISTENCY  FOR  AN  L-FOLD  CLASSIFICATION  SYSTEM,  L  >  2 

Max  A.  Bershad 


We  have  been  employing  an  Index  of 
Inconsistency1,  which  we  now  call  I,  which  is 
useful  for  a  two-fold  classification  system; 
that  is,  for  a  class  and  its  complement.  This 
memorandum  extends  the  index  to  more  than  a 
two-fold  classification  system,  using  the  same 
rational  that  led  to  I  and  makes  I  a  special 
case  of  the  more  general  index. 

I  has  previously  been  defined  as  the 
proportion  of  the  total  variance  of  a  variable 
of  interest  accounted  for  by  the  response 
variance.  More  specifically  it  was  defined  as 
the  ratio 

E(X  -X  )2  *  E(X  -X  )2         (1) 
J "   J  •         j  u   •  • 

where  X   was  the  value  given  to  the  j-th 

Jt 

element  of  the  population  on  the  t-th  trial. 
The  other  symbols  have  their  usual  meaning  and 
are  in  the  usual  model.' 
variable,  I  was  defined  as 


When  X ..  was  a  0,  1 
jt 


or 


where 


1.0 


PQ, 


(2) 


i,o 


KP   +P   )  *  PQ 
l.o  o.i 


=  P    -EX,+(1-X,.,); 


o,  1 


jtx 


jt' 


P  =  EX    and  Q  =  1-P. 


XA   detailed  description  of  the  development 
of  the  index  of  inconsistency  is  given  in 
reference  [2], 

2Ibid. 


A  more  general  index,  I  ,  is  defined  as 

the  proportion  of  the  expected  squared  distance 

of  the  point  X.   from  the  point  X   (i.e.  total 

variance)  accounted  for  by  the  expected  squared 

distance  of  the  point  X.,  from  the  point  X. 

-Jt  -J. 

(i.e.  response  variance).  All  the  points 

mentioned  are  L-dimensional. 
In  symbols 

e|x    -x    |2 
I       =        Z2l     J-  g  (3) 

L  EIX..-X T2  KJJ 

l-jt  -..I 

where 

X.       =    \X   .,  X   .,  X       ,    ....,  x., 

-Jt  1JV       2jt'       ajt'  kjt^ 

W 

X.         =      \X.    ,    X.    ,    X.    ......    X,   .    » 

-J.  1J.  2J.  3J.  kj. 

'XLj.' 

X  =     (X        ,    X        ,    X        ,....,    X        , 

,  x      ). 

In  this  symbolism  X  .   represents  the  value 

for  the  k-th  classification  assigned  to  the 

j-th  individual  on  the  t-th  trial.  Typically, 

in  X.,  all  the  Xn   's  may  have  the  value  zero 
-Jt         kjt 

except  for  one,  but  this  typical  form  is  not  a 

necessity.  Typically,  the  k  subscript  for 

which  the  X  is  not  zero  shows  the  code  into 

which  the  element  has  been  classified.  Thus 

X   =  (0,  0,  7,  0,  0)  indicates  a  five-fold 
Jt 

classification  system  with  j  being  coded  in 
the  third  class  on  trial  t. 


From  its  definition,  I  can  also  be 


written  as 


E  Z  (X,  ..-X.  .  ): 


(M 


kjt  k. . 


If  we  now  represent 


and 


we  have 


E(XkjfXkJ2     aS     °h)> 


apt(k)  *  °U)     as    Vk)' 


L 
Z  I 


2(k)a(k) 

T ' 

.2 


so  that  IT  is  a  weighted  average  of  the  I's 
one  would  have  obtained  by  making  a  separate 
computation  for  each  classification. 

When  dealing  with  a  0,1  variable — that  is, 
X   is  of  the  form  (0,  0,  0,  0,  1,   0,  . ..,  0) , 

where  all  elements  are  0  except  one  which  is 
1 — and  letting 


Pvvt   ~  E^vAiui^ 


kk 


kjt'k'jt1 


and 

we  have 


~  E(XkjtXkjt'^ 
.k   =  Pk.  =  Pk  =  ^kjt' 


Z  (PV-PVJ     1  -  Z  P 
,    k  kk         .   kk 
k  k 


(5) 


ZPA 


1-ZK 


The  numerator  is  the  proportion  of  the 
diagonal  in  the  L  X  L  frequency  table  repre- 
senting the  tabulation  after  all  possible 
pairs  of  trials  on  the  same  individuals.  Con- 
sequently, if  IT  is  to  be  estimated  from  an 
experiment,  a  three  column  table  which  shows 

p,  ,  p  ,  ,  and  pn .  for  each  classification  is 
^k.'   .k'     *Kk 


all  that  is  required.  For  the  case  preceding, 
it  will  also  be  noted  that  for  L=2  I  is  the 

'   2 

same  as  the  Index  of  Inconsistency,  I,  already 
defined. 

Another  way  of  writing  equation  (5)  for 

h   iS 


1  - 


H..  -  Z  Pf 
k  tt   k  k 


EP4 


k 

where  the  numerator  of  the  fraction  can  be 
thought  of  as  the  excess  of  the  number  on  the 
diagonal  over  what  the  diagonal  would  have 
been  if  both  trials  were  completely  inde- 
pendent, that  is,  if  the  independence  also 
extended  over  the  elements  of  the  population 
instead  of  being  confined  to  independence 
over  trials  for  the  same  individuals. 

The  index  of  inconsistency  can  also  be 
thought  of  in  terms  other  than  as  the  ratio  of 
two  variances.  For  example,  just  as  I  could 
have  been  defined  as  the  complement  of  the 
correlation  coefficient,  that  is 

1  '  Pxjt,xjt" 

so  also  IT  can  be  defined  in  terms  of  correla- 

J-j 

tion  coefficients.  We  will  not  try  to  examine 
here  other  approaches  to  the  index  or  its 
relation  to  other  measures  of  association. 

One  would  expect  IT  to  increase  when  L 
increases  and  distinctions  become  more  diffi- 
cult to  make.  It  will  be  noticed  .that  the 
numerator  of  the  index  involves  only  the  num- 
ber on  the  diagonal  and  otherwise  does  not 
involve  any  concept  of  dispersion  from  the 
diagonal.  An  illustrative  computation  of 
the  indexes  from  the  frequencies  for  1950 
Post  Enumeration  Survey  (PES)  income  data3 


^hese  frequencies  are  given  on  page  U2 
of  reference  [5]. 


results  in  the  following  table  and  bears  out 
the  expectation.   (Note:   (i)  the  PES  and 
Census  have  been  considered  the  first  and 
second  trials  of  procedure,  (ii)  income 
classes  are  rankable). 


Number  of  income 

Index  of 

groups 

inconsistency 

(L) 

(V 

Ik 

.422 

8 

.320 

h 

.220 

2 
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A  COMPARISON  OF  VARIANCE  AND  THE  INFORMATION 
MEASURE,  H,  AS  PARAMETERS  OF  A  CATEGORICAL  DISTRIBUTION 

Max  A.  Bershad 


PURPOSE 

The  purpose  of  this  memorandum  is  to 
compare  some  of  the  characteristics  of  a 
variance  measure  and  of  the  information  or 
entropy  measure,  and  to  show  that  the  latter 
measure,  like  a  variance  measure,  may  be  viewed 
as  a  population  parameter.  The  information 
measure  is  then  used  to  calculate  the  index  of 
inconsistency  in  a  response  error  problem 
just  as  variances  were  used  previously. 
Dr.  Nathan  Keyfitz  of  the  University  of 
Chicago  suggested  the  application  of  the 
information  measure  to  response  error  measure- 
ment in  a  letter  to  the  Bureau. 

NOTATION  AND  DEFINITIONS 

Consider  a  population  of  N  elements  j, 
j  =  1,  2,  3,  ...,  N.  Each  element  is  classi- 
fied as  belonging  to  one  of  the  L  categories 
of  a  classification  system  based  on  the 
attribute  A. 

By  the  variance  of  the  categorical 
distribution,  denoted  by  V(A),  we  mean 


V(A) 


where 


X. 

-J 


(X: 


=  e|x.-x|2 


X.  ,  ...,  X .  j  )  . 


(la) 


Jl    J2 


Every  X .,  except  one  in  the  point  X.  is  0. 
J-K  —J 

The  exception  has  the  value  1  for  the  category 

in  which  j  is  classified.  Thus 

X   =  (0,1,  0,  0) 

indicates  that  the  j-th  unit  is  classified  in 
the  second  of  four  categories.  If,  in  the 
population,  there  are  P.N  units  in  the  k-th 

K. 

category,  equation  (la)  is  equivalent  to 


where 


and 


V(A)   =  Z     P  Q^ 

k=i 


=  1  -  Z  Pf 
-  (!-£>-  Log. 


4.      =  L^V15)^ 
i      k 


(lb) 

(lc) 
(Id) 


P  =  L^Pk 
k 


The  entropy  or  information  measure  is 
usually  denoted  by  H(A)  with 


H(A)   = 


-Z  PklogPk. 
k=i 


(2) 


The  base  of  the  logarithm  is  frequently 
taken  as  2,  but  another  base  may  be  used. 

COMPARISONS  OF  THE  TWO  MEASURES 

(l)  Comparison  as  loss  functions. 
The  variance  may  be  viewed  as  the  average  loss 
resulting  from  a  process  in  which  (a)  an 
element  is  picked  at  random  from  population; 
(b)  its  classification  is  observed--e.g. 
(0,  1,  0,  0) ;  (c)  the  statement  is  made  that 
this  observation  is  the  population  mean;  (d)  a 
loss  results  that  is  equal  to  the  squared- 
distance  of  the  observation  from  the  true 


"""The  index  of  inconsistency  is  defined  in 
the  first  paragraphs  of  the  preceding  note, 
"The  Index  of  Inconsistency  for  an  L-Fold 
Classification  System,  L  >  2.  "  For  a  detailed 

description  see  reference  [2]. 


population  mean--e.g.  the  squared- distance 
between  (0,  1,  0,  0)  and  (P  ,  P  ,   P  ,  P  ) ■ 

1     2    3     4  ' 

and  (e)  the  process  is  repeated. 

Since  this  average  loss  is  equal  to 

I   Pk(l-Pk), 

k 

one  has  an  alternate  way  of  expressing  the 
loss--namely,  that  if  the  selected  element 
belongs  to  the  k-th  category  the  loss  is 

^k' 

Whereas  V(A)  employs  the  loss  function 

^k' 
H(A)  employs  the  loss  function 

(log  1  -  log  Pk). 

It  is  interesting  to  note  that 

"*  -  \ 

is  the  leading  term  in  the  expansion 

log  1  -  log  Pk  =  Q^  +  -|  +  -j-  +   . . . . 

(2)   Comparison  for  largest  and  smallest 
values.  The  largest  value  of  V(A),  as  seen 
from  equation  (id),  occurs  when 


0; 


that  is,  all  categories  have  the  same  pro- 
portion, — ,   of  the  population.  The  largest 


possible  value  is  thus 

The  largest  value  of  H(A)  occurs  in  the  same 
circumstance  and  is  log  L.   It  is  interesting 
that  both  maxima,  i.e. 

1  -  —  and  log  L, 

increase  with  increasing  L.  However,  the 
former  cannot  exceed  unity  but  the  latter  can 
increase  indefinitely  for  a  fixed  logarithmic 
base. 


The  smallest  value  of  V(A)  occurs,  as  seen 
from  equation  (lc),  when  the  entire  population 
belongs  to  one  category- -in  which  event 
V(A)  =  0.  H(A)  also  is  least  for  such  a  popu- 
lation and  also  has  the  value  0.  Thus 

(3) 

(M 


0  <  v(a)  <  1 

0  <  H(A) 


L 


<  1 


<  log  L. 


( 3)   Comparison  for  increasing  number  of 
classes.   It  Is  clear  that  the  addition  to  the 
classification  scheme  of  a  category  to  which 
no  element  belongs  leaves  both  V(A)  and  H(A) 
unchanged. 

Splitting  a  category,  say  k,  into  two  new 
categories,  say  k'  and  k",  increases  V(A). 
The  amount  of  increase  is 


(p2  +p2  \  +  (  p   +p   \2 

^k'  k";   ^  k'  k"; 


2P  P 
k'  k" 


At  the  same  time  H(A)  also  increases.   This 
follows  from  the  fact  that 


[-pPk  log(pPk)-qPk  log(qPk)] 


[-Pk  logPk] 


Pk[-p  logp-q  log  q] 


where 


and 


-  pPn 


k" 


=   qP 


k 


The  difference  is  P  times  the  entropy  of  the 
classification  system  within  category  'k. 

{h)      Comparison  for  independence.   If  the 
classification  system,  A,  is  the  intersection 
(product)  of  two  other  classification  systems 
B  and  C,  then 

H(A)   =  H(BC)   =  -  Z  S  P. .  log  P. . 

i  j   1J      1J 
=  -  Z   P.  logP.-  Z  P.  Z(P.  ./P.)  log(P.  ./P.) 

1  1      J 

H(BC)   =  H(B)  +  Z  P.  HD(C)   =  H(B)+H(c|b) 

1  ti 

(5) 


where  the  algebra  suggests  the  meaning  of  the 
symbols  without  further  definition.  If  B  and 
C  are  independent,  i.e., 


ij 


-  P.P., 


then 

H(BC)   =  H(B)+H(C). 
Of  particular  interest  is  the  case  where  B  is 
the  classification  system  of  a  first  enumera- 
tion and  C  is  the  same  as  B  but  is  used  on  a 
second  enumeration.   C  will  be  written  as  B'. 
Then 

H(BB')   =  H(b)+H(B'|b) 
which  is  equal  to  2H(B')  if  the  two  are  inde- 
pendent and 

H(B')   =  H(B); 
and  which  is  equal  to  H(B)  if  completely 
dependent. 

V(A)  on  the  other  hand  will  not  be 
additive  under  independence.   In  place  of 
equation  (5)  we  have 


2  (l±i 

i  \  D2 


V(A)   =  V(BC)   =  1-LZP 


=  1-2  p2(l-VB(C)) 

=  V(B)  +  Z  P2  VB(C).  (6) 

The  conditional  measure  is  weighted  by  the 
square  of  P.  and  not  by  P.  as  in  equation  ( 5) • 
Equation  (6)  under  independence  leads  to  the 
relationship 

[l-V(BC)]  =   [l-V(B)  ][l-V(C)  ]; 
that  is,  the  complements  of  the  variances  are 
multiplicative  under  independence. 

(5)   Comparison  for  pairs  of  population 
elements.  For  variances 


a2  =  E  Ix.-X 
1— J  - 


|e  |x.-x.,  |2 

2  -j  -j 


where  E  is  the  expected  value  of  all  selec- 

1 

tions  of  one  element  from  the  population,  and 

E  is  regarded  as  the  expected  value  over  all 
2 

pairs  of  two  elements  (selected  with  replace- 
ment--!, e.,  independently).  For  the  special 


case  of  a  categorical  distribution  we  will  now 
write 

V(A)  =  V(A'|A)  (7) 

with  A  being  the  classification  system  for  an 
element  and  A, A'  denoting  the  same  -classifi- 
cation system  as  it  is  needed  to  describe  the 
pair  of  elements  j,j'. 

And  analogously  for  the  information  measure 
H(A)  =     H(A'|A)  (8) 

where  A  is  the  classification  system  for  an 
element,  and  A, A'  denotes  the  same  classifi- 
cation system  as  it  is  needed  to  describe 
the  pair  of  elements. 

The  same  considerations  hold  for  the 
uncorrelated  part  of  response  variances  namely 


a2  =  Elx.  -X.  I2 
-Jt  -J. 


iEJx-+-X-+. 


R    ~-jt-j.'      *-a'=Jt  -jt' 
where  E  is  the  expected  value  over  all  selec- 
tions of  one  trial  for  a  given  j,  which  is 

then  averaged  over  all  j;  and  E  is  the  ex- 

2 

pected  value  of  all  pairs  of  two  independent 
trials  for  a  given  j  which  is  then  averaged 
over  all  j.  For  the  special  case  of  a  cate- 
gorical distribution  we  will  now  write 

vr(a-|a) 

with  A  being  the  classification  sytem  for  jt 
and  A'  for  jt'  where  the  pair  jt,  jt'  con- 
stitutes an  element  of  a  new  population. 

The  analogue  for  the  information  measure 
is  HD(A'  A)  where  A  is  the  classification 
system  used  for  jt  and  A'  is  the  same  classi- 
fication system  applied  to-  jt'.   Again  the 
subscript,  R,  reminds  us  that  j  is  constant  on 
both  trials. 

To  evaluate  the  uncorrelated  response 
variance,  or 


and  also 


vr(a-|a), 


hr(a'|a), 


the  data  we  usually  have  relate  to  a  large 
sample  of  the  elements  j  but  only  a  sample 


of  one  pair  of  trials  for  each  element.  The 
data  are  tabulated  as  frequencies  in  a  square 
table  whose  vertical  stub  is  the  classification 
(A)  on  the  first  trial  and  whose  horizontal 
stub  is  the  classification  (A')  on  the  second 
trial.  The  frequencies  are  converted  to  sample 
proportions  which  are  used  as  estimates  of  pop- 
ulation proportions  even  though  p  log  p  is  a 
biased  estimate  of  P  log  P  by  a  term  of  order 
n"1. 

THE  INDEX  OF  INCONSISTENCY 

The  maximum  value  of 

HR(A'|A)   =  HR(A'A)-HR(A) 
is  H(A)  for  independence  and  is  zero  for  com- 
plete dependence.  This  is  the  measure  sug- 
gested by  Dr.  Keyfitz  which  he  calls  Shannon's 
"measure  of  equivocation." 

Comparable  to  the  index  of  inconsistency, 

the  ratio  of  a2  to  its  maximum  possible  value. 
K 

namely  a2  (see  preceding  memorandum,  "The 
Index  of  Inconsistency  for  an  L-Fold  Classifi- 
cation System,  L  >  2")  we  would  have 

hr(a'|a)-h(a-|a)  =    hr(a«|a)+h(a).   (9) 

Dr.  Solomon  Kullback  of  The  George 
Washington  University  has  a  measure 

1(1;  2:  AA<)   =  HR(A)+HR(A' )-Hr(AA- ) 

=  Hr(A)-Hr(A'|a), 

which  is  defined  as  the  mean  under  H  of  the 

1 
discrimination  information  in  an  observation 

for  the  hypothesis  H  against  H  where  H  is 

121 
the  hypothesis  that  implies  the  actual  depen- 
dence of  A  and  A'  while  H  implies  their 

2 
statistical  independence.  This  is  the  measure 

that  might  be  used  for  an  index  of  consistency 
after  division  by  H(A) .  Such  a  measure  would 
be  the  complement  (would  sum  to  unity)  of  the 
Index  of  Inconsistency. 


AN  APPLICATION 

To  observe  the  nature  of  the  variances 
and  of  the  information  measures  considered, 
computations  have  been  made  for  1950  Post- 
Enumeration  Survey  (PES)  income  data.2 


Function 


a 


k 


a2  =  Z  PkQk  =  V(A'|A). 
k 

Index  of  Inconsistency 


HR(AA').. 
H(A) 


HR(A'|A)  =  HR(AA')-H(A). 
HR(A»|A>H(A»|A) 


H„(A' I  A)  changed  to  base  L. 
K 


aMl-l/L). 
H(A)-Mog  L. 


The  number  of 
income  groups  (L) 


.O656 

.309 

,212 
.3097 

,2120 
■  0977 
,1+61 
,3246 

,62 
•  70 


T 


.148 

.6^+7 

.229 
•  72^3 

.4982 
.2261 

.454 

.3755 
.86 


.263 
.822 

.320 
1.1910 

.7965 
•  39^5 

.^95 

.4368 

.3k 


Logarithms  have  been  taken  to  the  base  ten 
unless  otherwise  indicated. 

These  indexes  are  slightly  revised  from 
similar  figures  given  in  the  preceding  memo- 
randum "The  Index  of  Inconsistency  for  an 
L-fold  Classification  System,  L  >  2. 

Of  great  interest  is  the  comparison  of  the 
two  methods  of  computing  the  index  of  incon- 
sistency. 

o2/a2   (or  VD(A» |a)*V(A' |a)) 

K  K 

results  in  an  increasing  index  with  increasing 
L;  H_(A'|a),  on  the  other  hand,  does  not  seem 
to  be  affected  that  way.   How  would  an  ideal 
index  behave?   Should  it  increase  with  L; 
should  it  have  the  effect  of  L  removed;  or 
should  we  have  more  than  one  index  in  order 
to  answer  both  questions? 

2See  page  h2   of  reference  [3]. 
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CONCLUSION 


Since  the  information  measures  are  confined 
to  categorical  distributions,  they  do  not  seem 
to  have  the  generality  of  the  variance  meas- 
ures.  Consequently  they  do  not  cope  with  the 
problem  in  classification  of  giving  greater 
importance  to  the  classification  of  elements 
having  larger  and  more  important  quantities. 
Furthermore,  unlike  the  variance  measures,  they 
cannot  also  accommodate  the  classification  of 
an  element  in  more  than  one  category  by  means 


of  percentages  which  add  to  one.   In  addition 
they  are  more  unfamiliar  to  us  than  variances . 
On  the  other  hand,  variances  do  not  have  the 
additive  quality  that  the  information  measure 
has  for  independent  classifications. 

It  would  seem  desirable  to  obtain  both 
types  of  measures  in  our  response  error  work 
and  thus  to  obtain  greater  familiarity  with  . 
the  quantities  and  characteristics  of  the 
information  measures . 


VARIANCE  OF  AN  ESTIMATE  OF  ENUMERATOR  VARIABILITY,  USING  A  REPLICATED  SAMPL 

Margaret  Gurney 


Summary 

Using  a  model  for  enumerator  variance  an 
approximation  to  the  variance  of  the  estimate 
of  enumerator  variance  in  a  Census  is  obtained. 
(See  Eq.  12.) 

It  is  found  that  an  interpenetrating 
sample  of  two  enumerators  in  each  of  100  assign- 
ments will  produce,  for  a  (0,l)  variate  which 
is  10  percent  or  more  of  the  population,  an 
estimate  of  enumerator  variability  with  a  re- 
liability of  ho   percent  or  better,  on  the 
average. 

A.   The  Model1 


\     c 
e  ^-^ 

Cl 

C2 

CK 

el 

62 

w 

X 
.e 

eN 

X 
c. 

X 

A  population  consisting  of  N  elements 
(e  ,  e  ,  ...  ,  eM)>  which  may  be  persons, 

farms,  households,  EDs,  etc.,  is  enumerated  by 

each  of  K  enumerators  (c  ,  c  ,  ...  ,  c  ).   A 

statistic  X   is  observed,  for  each  combination 
ce 

(c,e),  and  the  average  X   (of  every  element  of 

the  population  as  enumerated  by  enumerator  c) 

is  obtained.   The  variance  of  X   around  X 

c. 

(the  mean  for  all  enumerators)  is  defined  as 
the  Enumerator  Variance,  a^       in  the  population. 

TV 
C  . 


^or  a  description  of  the  development  of 
this  model  see  reference  [l]. 


The  average  for  a  single  person  (e),  as 

enumerated  by  all  enumerators  is  called  X   an 
.e 

E(X  -X  )2  is  defined  as  the  Sampling  Variance 

a2   in  the  population, 
.e 


The  total  variance  for  the  population  can 
be  broken  into  several  pieces: 

=  E(X   -X  Y 


X      v  ce 


■  °i+0!  +af 


CD 


.e   c. 
where  unit  variances  are  as  follows : 


o; 


E(X  -X  -X  +X 


Jx 


c. 


ce  c. 


=  e(x  -x  V 

.e   .  . 


=  E(X   -X   )' 

c . 


is  called 

"interaction" 

is  the  "sampling 

variance" 

is  the  "enumerat 

variance" 


B.   An  Experiment 


Now  consider  an  experiment  with  an 
interpenetrating  sample,  in  which  each  of  k 
enumerators  interviews  a  different  set  of  n 
persons,  and  form  the  estimate 


x  = 


k  n 

—  Z  Z  x 
kn      ce 
c  e 


Then 


E  x  =  X 


\.    c 
e  \. 

c 

c' 

.              C^ 

e 

/ 

e' 

/ 

e" 

/ 

7 

• 

/ 

U    ) 

ce' 

/ 

7 

7 

e(ta) 

7 

X 

c 

V 

X 

10 


■ 


Assume  that  the  set  (of  size  K)  from 
which  the  enumerators  are  selected  is  infinite, 

The  total  variance  of  x  can  be  shown 


to  be 

a2  - 

X 

1-n/N  ?  + 
kn    I 

l-kn/N 
kn 

where 

N    2 

.e 

N     2 

t^Tax 

NOTE 

:   When  N 

is  of  th 

c. 


(2) 


Enumeration  Area,  and  k=l,  we  have 
the  Census  situation  in  a  single 
EA.   For  a  sampled  item,  with  n=fN, 
and  k=l, 


1-f 


I  5x 


.e 


+  a 


X 


c. 


For  a  100  percent  Census  item  a' 


becomes  a2  ,  which  is  the 
X 
c. 

enumerator  variance. 


From  this  experiment  we  should  like  to 

obtain  an  estimate  of  a2  ,  the  enumerator 

X 
c. 

variance.  Before  doing  this,  let  us  introduce 
additional  notation,  and  write  a_   in  another 
form. 

C.   Additional  Notation 


Let 


E  (X     -X 
ce   .e 

c.e 


and 


VI 


:,e,e'\ 


X   -X 
ce   ,e 


X   ,-X 
ce'   .e 


Then  a2  is  the  average  (over  persons)  of 

the  variance  for  a  person  (e),  when  enumerated 

by  a  single  enumerator,  around  the  average  for 

all  enumerators.  &..  is  the  intra-class  corre- 
N 

lation  coefficient  of  the  deviations  of  pairs 
of  persons,  and  measures  the  increase  in  vari- 
ance when  several  elements  are  enumerated  by 
the  same  enumerator. 


We  find 


c. 


ti^-^n 


(3) 


and  for  N  large 


c. 


also 


so  that 

„2 


q2       „2 


w 


l-kn/N  c2     1   si 
kn    X    kn  R 


w 


D.   The  Estimate 


First:   Consider 


1 

kU-1) 


Z  (x  -x)' 

c 
c 


where 

Now 
E0    = 


x   =  —  E  x 
c     n    ce 


1-n/N  g2  + 


2_  o2   ,  c^    p   

kn   ~I   kn  X     k  N 

•  e         x 


.e 


(5) 


That  is,  0   is  an  over  estimate  of  the  total 

variance,  to  the  extent  of  S2  /N,  which  is 

X 

small. 

Second:  Let 


.e 


k  n 

Ice 

kn 


x 
ce  c. 


2     kn    k(n-l) 

This  is  an  over  estimate  of  the  non-enumerator 
part  of  the  variance  (to  the  extent  of 
S2/kN  +  S2  /N): 


E0Q  =  -L  S2  +  ^  S2   =  r^  S2  +  -£  (  1-5, 
2     kn  I   kn  X     kn  X    kn  I    N 
.  e       ce 


Third:   Let  f 


1  2' 


(6) 
(7) 


Then  f  is  an  underestimate  of  a2  /k,  and  can 

X 
c. 

be  shown  to  be 


E(f) 


3N°R 


X     S 
c. 


2 
_I 

kN' 


(B) 


For  N  large,    kE(f)   = 


11 


E.   The  Form  of  f 

Before  working  with  the  variance  of  f, 
note  that,  since  K  is  infinite 


E  X  X 


cM' 


ce  c'e 


EX  EX,, 
ce  ,  c'e 
c   c ' 


X     X 


e    .e 


and  there  are  11  distinct  terms  with 
appropriate  coefficients. 

2.  Square  also  (&  a2),  and  obtain  21 
terms,  also  fourth  powers,  in  the  same 
variables . 


so  that  8  a2  may  be  written  as 
N  R 


5an2  =  E/X  X   ,-X  X  , 
MR       ce  ce'   ce  c'e' 


that  is 


EX  X   .  -  EX  X 


E(f) 


ce  ce '     ce  c'e 


'o  i 


Consequently,  f  must  be  of  the  form 
K, (Terms  in  x  x   ,)  -  K  (Terms  in  x  x  ,  ,), 


where  small  letters  indicate  sample  observa- 
tions, and  indeed  f  =  0-0   can  be  shown  to  be 
of  this  form: 


k     n 

kf       =      - i r-y  Z      Z      X      X 

knln-lj  /    ,    ce  ce' 

'    c   e;fce ' 


3-  Subtract,  with  proper  coefficients. 
k.     Simplify  certain  types  of  terms. 


For  example 


E 


X2  X2 


=  E  fEX^ 
/  ,  ce  ce '       Ice 
c,e^e'  c  \e 


E  EX 
c  e 


ce 


5.  There  are  26  different  terms  left  in 
the  expression  for  cr2  which  may  be  written 


k2a| 


1  f(n-2)(n-3)  /l  *  y4 

k{  n(n-l)   (k^  c. 


K 
Z  X' 


C.2 


2  /n-1 
k-1  (  n 


+  k  4^^  flzx2  Z  X2 

n(n-l)  IKN    c    ce 


1 


k    n 


k(k-l)n2   /  ,   /  ,  ce  c'e' 
c^=c '  e^e ' 


(9) 


and 


Vi 


K  N 

kE(f)   =  t_t/,t  nN  Z  Z  X  X   , 
'  KM  M-l  )       ,  .  ne  ce' 


KN(N-l) 
1 


>    ,  ce  ce' 
e^e ' 


c  e^e 
K   N 


Z   Z  X  X 


K2N(N-1)  o,c'  e^e'  ce  c  e 

(10) 


F.   Variance  of  f 


To  evaluate 


.  =  k2a2  =  E(kf)2  -  (5,To2") 


if 


2^2 
NWR* 


we  take  the  following  steps : 

1.  Square  (kf)  as  in  Eq.  (9)>  above,  and 
take  the  expected  value.  We  get  fourth  powers 
in  10  variables: 


X   •  X   i,X   ,  X   , 
ce   ce    ce"   ce ' 


Xc'e"  Xc'e'"  Xc'e'  '  ' 
Xc"e"'  Xc"e' ' ' 


c'  *  'e"  ' 


K 
Z  X' 


^^TT 


c . 


K  N 

EZX' 


)  C  ™ 


ce 


+  2 


T 

1 cV 

n(n-l) 


Z  X* 

ce  . 
e    J 


+   2 


K  N 
Z  Z  ) 
c  e 


n2(k-l) 


cev2 


KN 


plus  Terms  of  order  (—  ]  and  higher 


(11) 


G.   Simplification  of  a 


f 


Now  consider  f  and  E(f),  as  given  by  Eqs. 
(9)  and  (10).   These  equations  will  be 
unchanged  if  we  take  deviations  around  row 


means  [X  \   i.e.  replace 

x    by  x 
ce      ce 


X 


ce 


by  x 


ce ' 


•  e 
X 


.e 


If  K  were  finite,  there  would  also  be 
terms  of  order  (l/x). 
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X   by  X   -  X 
ce  J      ce    .e 

etc. 


Then,  in  Eq.  (ll)  we  replace 


i  K  1  K 


which  we  call 


similarly 


^:X 
c. 


K  N 
Z  Z  ) 
c  e 


ce 


is  replaced  by 


KN 


K  N 

Z  Z  (X     -X 

ce      .e 
c  e   ' 


KN  ' 


which   is   cVr. 
K 


If  we   also  assume   that  N  is   large,    so  that 

terms    in  (l/N)    can  be   dropped,    and  replace 

n-1     n-2  ,       -  , 

,   — -,    etc.    by  1,    we  have 

n   '    n-1' 


•,22      •      If  k-3     4 

k  af     =     kK:X     "  —  GX 
I         c.  c. 


1 

+  — 

n 


Z  (X     -X 

c   •    C' 


2   N 

Z  (X     -X 
ce 


KN 


1  2  2 

+  n  ax  ctr 

c. 


+  nTTTTT 


K    fN 

Z  Jz  (X     -X 
\c\e[    Ce 


)T 


KN' 


Then 


K  2  N  2 

Z  f X_    -X       )    Z  ( X    .-X    ^  ] 


c. 


ce      .  e  . 
e  \  / 


KN 


1+7 


K 
Z 

2         \    C 

X 
•  e 


(x-- 


and 


KN2 

K  / 

\4 

Z  (X      -X 
\         .e/ 

J 

So  that 

k8°£  ^i^ 


1+  -  A  l+V. 


C.L 


n  I        X 


,e 


f: 


1+7," 


nTn^TT  (^\ 


Ma4 

k-1    x 


k-l 


c. 

2  2 

■°X      aR 
c        +   2 


CC 


n 


nU-1) 


and  since 


E(kf)      =     5Na2     =     a*     , 

c. 


s j  \  K 1 1+ 1  (1+vx  ) + ot 


c. 

k-l       k-l 


.e 


2-i 


n5 


n       n(n-l)62_ 


.e/   J 
(13) 


Now,    for  nS„  large,    the  first  two  terms   domi- 

nate,    and  for  72       small  relative  to  n,   we  have 
X 


.e 


T. 


3        -^i 
X  k-l 

c. 


1        4 


(12) 


To  approximate  the  third  and  fifth  terms, 
assume 


X 


X  X 
c.  .e 


ce      X 


This  is  the  usual  formula  for  the  rel-variance 
of  the  estimate  of  the  variance  for  simple 
random  sampling,  This  indicates  that  the  vari- 
ance of  f  can  be  separated  into  pieces  to 
isolate  the  contribution  due  to  enumerators, 
just  as  f  itself  can  be  split  into  pieces,  to 
separate  sampling  variance,  enumerator  variance, 
and  interaction. 
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H.  Average  over  Several  EA's 

Suppose  independent  experiments  are 
conducted  in  a  number  (h)  of  Enumerator 
Assignments,  and  let 


e. 


and 


Then 


-  k.f.,  with  E© 


A  A 


A 


=  \\ 


a2X 


A  c. 


e 


e  .  i  l  e 


1  /  2 

—  (  average  of  vr 


A 


1  h 

-  y   k2v2 

h  A  AV 


I.   Size  of  Experiment  Needed  to  give 
Acceptable  Reliability 

Suppose  we  have  two  enumerators 
interpenetrated  in  each  of  100  EA's  each  of 
size  200  elements;  and  that  V2   (which  is 

A 

•  e 

less  than  the  total  V2)  is  not  more  than  nine, 

a 


so  that  1  +  V. 


X 


<  10.  This  last  condition  is 


.e 


fulfilled  for  a  (0,l)  variate  if  the  proportion 
of  ones  in  the  population  is  10  percent  or 
greater. 

The  value  of  n,  under  census  conditions, 
will  be  N/2  =  100;  5  may  be  expected  to  be 
larger  than  l/N,  in  most  practical  cases,  so 

,  over 


that  n5.T  >  .5.  The  value  of  pv 
N  —  X 


c. 


enumerators,  is  not  known;  the  following  table 

shows  the  value  of  V_,  for  several  possible 

3 


values  of  P 


X 


Reliability  of  estimate  © 


c. 


of  enumerator  variance  over  100  EA's,  when 
within  each  EA: 


N  =  200,    Js  =  2, 


100, 


V* 


=  9 


•  e 


200 


V2  =  ^Jb        (1.42)  +  1  + 


c. 


1005, 


(1005N)2 


p 

Coefficient  of  Variation  of  © 

nSN  =  «5 

n6N  =  1 

n6  large 

3 

.28 

.20 

.16 

7 

•  32 

.26 

.23 

15 

.40 

.36 

.33 

100 

.88 

.85 

.85 

A  value  of  P  between  3  and  15,  with 
n5  =  1,  is  typical  of  many  situations;  so  that 
with  100  EA's  (or  strata)  of  the  type  described 
above,  an  estimate  f ,  made  according  to  Eq.  ( 7) 
or  (9);  will  have  a  reliability  of  about  40 
percent;  and  in  many  cases  this  will  be  a 
conservative  statement. 

J.   Notes  and  Extensions 

(1)  The  effect  of  interpenetrating  more 
enumerators  will,  in  general,  reduce  the 
numbers  in  the  table  somewhat.   For  example, 

if  there  are  four  enumerators,  each  enumerating 
50  elements,  the  figures  in  the  column  for 
"n§   large"  will  be  multiplied  by  l/  /2. 
However,  for  p  small  and  n6  small,  the  figure 
will  change  only  slightly,  and  may  even  be 
increased. 

(2)  A  larger  value  of  V2   will  increase 

A. 

0e 

all  entries  in  the  table.  For  example,  if 


V. 


X 


=  99  ( instead  of  9)  the  numbers  in  the 


.e 


table  will  be  about  twice  as  large. 

(3)  For  a  sample  survey  (or  the  sample 
questions  on  a  census)  the  entries  in  the 
table  would  be  increased.  For  example,  on  the 
25  percent  items  in  a  census,  the  two  enumer- 
ators would  each  enumerate  only  25  cases 
(compared  with  100  of  the  100  percent  items). 
For  items  for  which  b      is  large,  so  that  25  5 
is  still  greater  than  unity,  the  entry  might 
be  about  \/~2   times  as  large  as  before;  while 
for  5  small,  the  new  entry  might  be  twice 
the  old  one. 
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( h)   Other  questions  of  interest  which 
should  be  examined  are: 

(a)  The  effect  of  cost  factors  on  the 
computation. 

(b)  The  effect  of  variable  sizes  of 
EA's  (enumerators*  assignments), 
and  how  to  minimize  their  effect. 
The  estimates  used  in  this  memo- 
randum are  simple  unbiased 
estimates — can  the  results  be 
extended  to  ratio  estimates? 

(c)  The  effect  of  unequal  penetration: 
For  example,  one  enumerator  does 
80  percent  of  the  work,  and  the 
second  does  20  percent. 


(d)  The  extension  to  more  than  one 
level  of  interpenetration:   For 
example,  the  superposition  of  ran- 
domization of  crew  leaders. 

(e)  The  combination  of  interpenetration 
with  repetition  of  some  phases  of 
the  Census  operation — for  example, 
duplicate  processing  to  determine 
the  effect  of  processor  errors. 

(f)  The  separation  of  the  components 

0  and  0  of  f,  to  assist  in  arriv- 
ing at  the  best  possible  experi- 
mental design. 


VARIANCE  OF  AN  ESTIMATE  OF  ENUMERATOR  VARIABILITY,  USING  A  REPLICATED  SAMPLE 

Margaret  Gurney 

C.  The  Estimate 


Summary 

This  note  deals  with  the  same  population 
as  that  described  in  the  preceding  note.  Here, 
instead  of  an  interpenetrating  sample  of  size 
2n,  a  sample  of  n  elements  is  enumerated  twice, 
by  two  different  enumerators.   "Enumerator 
variance"  is  defined,  and  an  estimator  of  it  is 
obtained;  an  approximation  to  the  variance  of 
the  estimator  is  derived. 

It  is  found  that  a  sample  of  100  elements, 
each  enumerated  twice,  in  each  of  100  strata, 
will  produce,  for  a  (0,1)  variate  which  is  10 
percent  or  more  of  the  population,  an  estimate 
of  enumerator  variability  with  a  reliability  of 
i+0  percent  or  better.   This  is  of  the  same 
order  as  the  result  of  the  earlier  note. 

In  both  cases,  the  estimate  is  of 
enumerator  variance;  for  the  standard  error 
(in  each  case),  the  reliability  would  be  of  the 
order  of  20  percent  or  better. 

A.   The  Model 

The  population  consists  of  N  elements, 
each  enumerated  by  each  of  K  enumerators,  and 
the  value  of  the  e-th  element,  as  observed  by 


the  c-th  enumerator  is  X 


ce 


infinite, 


X 


c.- 


X   and  X 
.e 


K  is  assumed  to  be 


are  defined  as 


before,  and  the  population  variance  (for  a 
sample  of  one  element)  is 


2      2 

=  CTi+ax 


+  a 


.e 


X 


where  a2  is  the  "interaction," 


c. 

2 


is  the 


"sampling  variance,"  and  a2 

A 


.e 
is  the 


c. 


"enumerator  variance." 

B.  An  Experiment:   Replication 

From  the  population  a  sample  of  n  elements 
is  selected,  and  each  of  these  is  enumerated  by 
two  enumerators.  That  is,  the  sample  is  repli- 
cated k  times,  with  k  =  2  (for  example,  dupli- 
cate coding  or  editing  of  schedules). 


Let 


A 


=  X  -X 


ce  c  'e 


=  Z 


ce  c  'e 


and 


A 


where 

Z 
ce 

and  let 

A'   = 
The  estimate 

9     .        - 


N 

=     EA/N 
e 
e 


X     -X      : 
ce      .e' 


1   n 
-  Z  A 
n  e 

e 


1    " 

=    -  z(z     -z   ,   ). 


ce      c '  e 


N=H    i    E(A_A,)2 

Nn  n-1  e        ' 

e 


(1) 


is  an  unbiased  estimate  of 


1 


N 


JX 


c. 


NZ(XC.-XJ    > 

c 


the  so-called  "enumerator  variance." 

With  some  algebraic  manipulation  we  find 


$     = 


C.2      ,      1  ■        '  2  A       A  1 


2  n 


-  J  Z'  A^   + 


and 


PN 


E        ( 

c ,  c  '  ;  e 


Z     A  A    . 
n-1       /    ,    e  e' 
e^e' 


(2) 


2N2  c.c' 


N 
(Z  A  J2 


=     a 


1 
2N2 

.2 


c,  c' 


(Z   -Z    ,); 
c      c' 


X 


(3) 


c, 


D. 


Variance  of  $ 

Using  $  as  given 

by  Eq.    (2)    we   find 

■  £(0{<! 

A2}2+/N-l\2/^       AA 

6          Vn-VWe'    e  e 

+  <-) 

(E>e)(je,VeO) 

variance  of  $  is 

defined  by 

=    E<r 


c. 


(5) 
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When  we  write  out  the  terms  in  0  ,  as  given  by 
Eq.  ( h) ,  and  take  expected  values  over  c's  and 
e's,  we  end  up  with  12  distinct  terms,  with 
coefficients  in  ( n)  and  (N).   For  N  large,  the 
dominating  terms  are 


,2 


.  1  fc  \e 


<5   2n 


Z  fZ  Z' 


ce 


KN' 


& 


Z  Z' 


c. 


K     N 

Z  Z2  Z  Z2 

c   C'  e   CG  /,,  n-2 

KN      V  n-X 

N 

Z  fZ  2 


K 
K  N 


(n-2)(n-3)  +  e  Vc 
n-1 


c.   ce 


K2N 


,n-2' 

n-1 


K 


Z  Z  Z2  Z  Z2 


K  K  N 

Z  Z2  Z  Z  Z2 

c.      ce , 

c  c  e    A.  n-2 


Cx2 


;(n-2)(n-3) 
3 nil 


K     KN 

N   K 

Z  f  Z  Z   Z 
,1    ce  ce 
e=e'\c 


^ 


K2N(N-l) 
E.   Simplification  of  a' 


n-1 


(6) 


c. 


There  are  several  terms  in  <r  which  we 
cannot  readily  evaluate.   In  order  to  get  some 
notion  of  its  magnitude,  let  us  assume 


X 


ce 


X 


x_eux 


Then 


ce 


X  X  -X  X 
c.  ,e   .e  . 


X  -X 


s  €(z0 


and  the  first  term  in  a^   becomes 


K  7N 
Z  (Z     Z; 


c  \  e 


ce 


KN' 


=  E   Z4 
c. 
c 


ifcr 


lU:X 


c. 


1+V,e 


.e 


Similar  substitutions  into  the  other  terms  of 
Eq.  (6)  lead  to 


.2 


1 


%     =  2n(n-l)  1^U:X 


c. 


.e 


c.  \    .e/ 

+  (n-2)(n-3)^u.x  +  8(n-2)a4  A+V2  "\ 
c.         c.\    ,e/ 


+  2a4+3(n-2)(n-3)o4  +Hn-2)  a^a 


2_2 


C. 


R  X 


c. 


+  Ua4 

c. 


.e 


(7) 


c. 


where 


E  (X  -X 
\  ce  .e 
c,e 


1+V2    =  E  ( X  /X 
X        \  .e 

.e     e 


l4:X 


c. 


E  (X     -X 


(8) 


JJ    ^  E  Xc."X 
c.     c 


If  we  replace 
divide  a2  by 


and  define 


n-1   n-2 
n  '  n-1 


,  etc.  by  1,  and 


a 


*2„4 

=  Vr 
c. 


=  Hi 


/c4 


*k:X     /UX 
c.        c.   c. 


we  find 


)        v^H\I^He)  +  ^ir(^e)] 


'nf  +  - 


N   n(n-l)62 


L+V2  ^  +  -, ty  A+V2 


(9) 


F.  Size  of  Experiment  Needed  to  Give 
Acceptable  Reliabilit; 

As  before,  consider  an  experiment  in  which 
the  estimate  $  (Eq.l)  is  averaged  over  100 
strata.   The  following  table  shows,  for  N=200, 
k=2,  n=100,  V2  =  9,    the  reliability  of  the 

A. 

.e 

average  estimate,  0,  for  several  possible 

values  of  3V   and  nS  . 
c. 
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.  l 


200  1 ^X 


PY  (1.42)  +  1  + 


c. 


_8o    hoo 


1005 


N   (1005N)2 


100   10,000 


I  c.  N   (n5N) 


Coefficient  of  variation  of  $: 

n  =  100 

p 

noN  =  .5 

n5N  =  X 

n5N  =  2 

n5N  =  5 

nSN  large 

3 

.33 

.25 

.21 

.19 

.17 

7 

.37 

.30 

.27 

.25 

.2k 

15 

M 

.38 

.36 

.35 

.3h 

100 

.89 

.87 

.86 

.85 

.85 

These  numbers  are  of  the  same  order  as  those 
presented  in  the  preceding  note  (see  pg.  13).  A 
similar  conclusion  may  be  drawn:   That  a  sample 
of  100  cases  from  100  EA's,  with  two  enumera- 
tors (or  processors)  handling  each  case,  will 
lead  to  an  estimate  0,  with  a  reliability  of 
about  ho   percent;  and  this  is  a  conservative 
statement. 


Remark  1:   $  is  an  estimate  of  a  variance 

the 


( av       ).     For  the  standard  error,    av 


c. 


reliability  of  the  estimate  is  about  twice  as 
great.-1  That  is,  /  $  is  an  estimate  of  a. 


X 


c. 


which  has  a  fairly  high  probability  of  being 
within  20  percent  of  the  "true"  value,  a. 


X 


Remark  2:   The  sample,  n  =  100,  within 

each  stratum  is  rather  large.   If  its  size 

is  reduced  considerably,  the  error  in  the 

estimate  0  will  be  increased.   For  example, 

with  100  strata:   keeping  N  =  200,  vf;   =  9, 

•  e 

and  taking  6  =  .02,  the  table  becomes 


"'"Hansen,  Hurwitz,  and  Madow:   Sample  Survey 
Methods  and  Theory,  Vol.  II,  page  102. 


13 

Coefficient  of 

variation  of  $: 

5N  -    '°2 

n  =  h 

n  =  10 

n  =  25 

n  =  50 

n  =  100 

3 

1.7^ 

.7h 

.33 

.27 

.21 

7 

1.89 

.Qk 

.hG 

.33 

.27 

15 

2.16 

.99 

.57 

M 

.36 

100 

U.07 

2.01 

1.23 

•  99 

.86 

This  table  suggests  that  if  a  small  sample 
is  taken  within  each  stratum  (or  EA)   then  a 
larger  number  of  strata  must  be  used  in  order 
to  make  $  sufficiently  reliable  to  be  useful. 

G.   Comparison  with  Result  for  Interpenetrating 
Sample 

The  result  in  Eq.  (9)  is  very  close  to  that 
found  in  Eq.  (13)  of  the  preceding  note.  The 
sample  size  in  each  case  is  kn,  i.e.,  the 
number  of  enumerators  times  the  sample  assigned 
to  each. 

First  we  note  that  Eq.  (13)  deals  with  the 


variance  of  f  around  its  expected  value  [  5  a 


and  contains  the  approximation  a 


X 


c. 


R 


=  6NaR.   In 


order  to  compare  V2  with  the  result  in  Eq.  ( 9) , 
above,  we  should  use  the  Mean  Square  Error  of  f, 
around  a2   .   From  Eq.  (8)  of  the  preceding 


c. 


note: 


n2    _ 

MSE(f>  -  °f  +  (§y 


2  ,  °R4(1-V; 
a„  + — 

1  i,2M2 


1TN' 


If  we  take  k  =  2,  and  do  NOT  make  the 
approximation 

„2         R   2 

ax   =  Vr 

c. 
Eq.  (13)  of  the  previous  note  becomes 


MSE(f)    ._   1 
'.  -   2 


0. 


1+  Vl+V? 
nl        X 


,e 


2 
n( n-l) 


i+v; 


r   0 


+  1  + 


Lno 


+  nlnTTy 


X 


]} 


c. 


(10) 


c. 


^Enumerator  Assignment. 
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If,  moreover,  we  do  NOT  make  the  approximation 
in  Eq.  (9)>  we  find  the  difference  between  the 
two  rel-variances  to  be 


MSE(f) 


13  °R 

2 


n  2    n 

ax 

G 


+  -  (  1+V. 


1+Yf 


nTT^iy    ^VX 


R  f  1-5. 


c . 

In  Eq.  (ll)  the  last  term,  which  is 
negative,  is  approximately 


^N2  o* 

A. 


(11) 


use  MSE(f)  instead  of  a?.   For  N  large  it  is 
small  and  can  be  neglected. 

Eq.  (ll)  measures  the  loss  in  efficiency 
in  going  back  to  a  sample  of  size  n  with  a 
second  enumerator,  rather  than  interpenetrating 
a  total  sample  of  size  2n  with  two  enumerators. 
For  N  large,  the  loss  is  approximately 


r2  lr2 


Vf-V„  =  L' 


=  i/4- +  2 


l+V" 


Un6N   »V\e 


k 


which  is  independent  of  3Y   .   The  relative 

-A. 

C. 


sizes  of 


^1+V|  J  and  1/6N 


this  correction  is  introduced  since  we  must 


determine  which  term  dominates, 


NOTATION 


AN  ESTIMATOR  OF  THE  BETWEEN-VARIABLE  COVARIANCE 
OF  RESPONSE  DEVIATIONS 

Leon  Pritzker 

ROLE  OF  RESPONSE  ERROR  IN  ESTIMATION  OF 
CORRELATION  COEFFICIENTS 


jt 


X, 


x 

dx. 


1  if  individual  j  is  classified 

in  the  category  of  interest  on 

trial  t 

=  0  if  otherwise 

=  E(x   )  where  the  expectation  is 
J  * 

taken  over  all  possible  trials 
(including  all  possible  samples 
and  orders  of  enumeration) 
=  E  X. 


jt 


J. 

X..-X. 

jt  J 


response  deviation 


a 


dx 


=  E  E(dx   )2  =  simple  response 


variance 
dx 

p     Cir p 

X.  .      X. 


index  of 


inconsistency. 

COVARIANCES  PREVIOUSLY  DEFINED  FOR  A  SINGLE 
VARIABLE2 

A.  Between- trial  covariance  of  response 
deviations 

E(dx.tdx.tT). 

B.  Within-trial  covariance  of  response 
deviations 

E(dx.tdx.,t). 

BETWEEN-VARIABLE  COVARIANCE  OF  RESPONSE 
DEVIATIONS 


Define   the  0-1  variate,    v.    • 

jt 


Then,  with 


the  same  notation  for  y . ,  as  for  x .  ,  the 

Jt        jt' 

covariance  is  defined  as: 

E(dx.tdy.t). 


Define  the  between-variable  correlation 
of response  deviations 


dxdy 


E(dVy,it> 


dx  dy 


(1) 


Express  the  correlation  between  variables 
estimated  from  paired  observations  on  a  single 
trial  as 


x.,y., 

jt  jt 


What  is  desired  to  be  estimated  is  the 
correlation  between  variables  based  on  the 
expected  values  of  the  observations, 

It  can  be  shown,  for  samples  sufficiently 
large,  that: 


x  .,y . , 
Jt"jt 


px7yt  t(i-ix)d-iy)F 


+  p,  ,  [I  I  J2. 

dxdy     x  y 


(2) 


Thus: 


XJ.YJ. 


Ki-i  )d-i  )J2 

x  y 

(p  -P,    A    [I    I    P).      (3) 

'Mx.,y.+    ^dxdy     x  y 

Jx   J  b 


Estimators  of  I   and  I   are  described 
x     y 

in  reference  [l].   Our  task  is  to  provide  an 

estimator  for  p.  ,  . 
dxdy 


"^he  underlying  model  is  described  in 
references  [l  ]  and  [2]. 

2See  reference  [2]. 
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THE  OBSERVATIONS 

Two  independent  trials,  using  the  same 
procedure,  are  made  on  the  same  sample  of  -n 
individuals  selected  with  equal  probability. 
Classifications  are  obtained  on  each  trial 
for  two  variables,  x  and  y.   The  classifica- 
tions are  distributed  as  follows: 


y 

X 

l 

0 

First 

trial 

(t) 

First 

trial 

(t) 

Second 
trial 

l 

0 

1 

0 

Second 
trial 

(tO 

1 
0 

1 
0 
1 
0 

a 
e 
i 

m 

b 
f 

J 
n 

c 

g 
k 

o 

d 
h 
& 
P 

where  a,  . ..,  p. are  the  proportions  of  the 
total  sample  in  each  cell. 

THE  ESTIMATOR  OF  THE  BETWEEN-VARIABLE 
COVARIANCE  OF  RESPONSE  DEVIATIONS,  AND 
ITS  DERIVATION 

The  estimator  is: 

cp>  =  |[(f+k)-(g+j)].       (M 

The  derivation  is  as  follows: 

An  intuitive  starting  point  for  the 
covariance  is  to  write  the  estimator: 


(5) 


<P'   =  oZ 


S  £(Wjt-xJtyjt' -xjt'^t 
+x.jfyjf> 

=  \ [( a+b+e+f ) -( a+e+c+g) -( a+b+i+ j) 

+(a+c+i+k)  ] 
=  |[(f+k)-(g+j)]. 


Now: 
E(cp') 


Mx  -x   )(y  -y   ) 


jt  J. 


Jt  "J- 

jt' 


=  iE([x.  +  -XT  ]-k.+  ,-XJ>])Uy.+-YT  ] 


jt  J. 


=  E(dx  dy  )j    assuming  independence 
Jt  Jt 

between  trials. 

AN  ALTERNATIVE  ESTIMATOR  OF  THE  BETWEEN- 
VARIABLE  COVARIANCE  OF  RESPONSE  DEVIATIONS, 
AND  ITS  DERIVATION 

Mr.    Bershad  has  suggested  the 
possibility  of  using  the  following  estimator 
(in  the   notation  used  above): 

cp'»     =     (e+f)-(i+j).  (6) 

He  starts  with: 


1 


*"   =  n  =(xjt-X.1f)y.1t 


(7) 


j=l 


=   n  *  <Wjt"x.it'y.it> 


j=l 


=  (e+f)-(i+j). 

cp  '  '  is  an  unbiased  estimator,  i.e.:- 

E(cp")   =  E(dx   dy   ). 
Jx   J^ 

The  proof  is  as  follows : 
E(<p")   .  E(x.t-X.t,)y.t 

■  E(VV-S'-xj])(Vyj]+V 


■   ^if^iv^^P- 


Since 


E(dx   dy  )   =  0  by  the  hypothesis  of 
J  t   ,i  t 


E(dx.  Y.) 
Jt  J 

E(cp") 


independent  trials  and 

■  E(dW  -  °< 

=  E(dx..dy..),  which  is 
Jt  Jjt" 

the  desired  covariance. 


The  estimators  shown  in  (5)  and  (  7) 
above  apply  to  continuous  variates  as  well 
as  0-1  variates. 


MEMORANDUM  ON  THE  DISTRIBUTION  OF  ERRORS  BY  INTERVIEWERS 

Benjamin  J.  Tepping 


Attention  is  sometimes  given  to  the 
proportion  of  the  errors  made  in  a  survey  by 
the  worst  x  percent  of  the  interviewers.   For 
example,  it  may  be  stated  that  90  percent  of 
the  errors  were  made  by  only  50  percent  of  the 
interviewers.   The  question  arises:   to  what 
degree  may  such  statements  be  taken  to  indi- 
cate heterogeneity  among  the  interviewers? 
This  suggests  finding  what  distribution  of 
errors  would  result  if  in  fact  all  interviewers 
were  essentially  alike. 

Let  each  interviewer  be  assigned  n 
trials,  and  let  p  (a  constant)  be  the  prob- 
ability of  an  error  on  each  trial.   Then  the 
number  of  errors  made  by  the  interviewers  will 
have  a  Bernoulli  distribution.   The  proportion 
of  interviewers  making  r  errors  or  more  is 


then 


1 


Z  (x)p  q 
x=r 


x  n-x 


and  the  proportion  of  errors  made  by  these 
interviewers  is 


n        n      X 
Z   x(x)p 

x=r 

n- 

q 

-X 

n 
Z 

n     x 
x(x)p 

n- 

q 

-X 

If  we  define 

At 
i 

Acp 


x=0 


1   „   /  \  x  n-x 

—  Z  x(xjp  q 
np       /J^  ^ 
x=r 


t  -ilr 

r  r+i 


=  cp  -cp 

r  Tr+i 


so  that 


A\|; 


Acp 


Acp 


= 

/  L\    r  n-r 
(r)p   q 

= 

r  / nN    r  n-r 
— (r)p  q 
np            H 

= 

r 

—  At    . 
np        r 

Now  t  and  At  can  be  read  from  a  table  of  the 
r       r 

Bernoulli  distribution,  and  thus  Acp  ( and  cp  ) 
can  easily  be  calculated. 

Graph  I  shows  the  proportion  of  errors 
(cp)  made  by  the  worst  t  percent  of  the  inter- 
viewers, for  selected  values  of  p  and  n. 

It  is  interesting  to  consider  the  case 

in  which  there  is  a  mixture  of  interviewers, 

with  the  proportion  n  of  the  interviewers 

having  the  intrinsic  error  rate  p  ,  and  the 

l 
proportion  1-jt  having  the  intrinsic  error 

rate  p  .   Then  the  proportion  of  interviewers 

2 

making  r  errors  or  more  is 

t  ,  =  t+  +(l-n)\|r 


lr 


2r 


and 


1 


^r  -     *p  +(1-«)P  ^Ar^-^J 

1  2 

where  t-   and  m.   denote  these  functions  for 
lr     Y  lr 

the  i-th  class  of  interviewers.   Graph  II 
shows  the  proportion  of  errors  (cp)  made  by 
the  worst  t  percent  of  the  interviewers  for 
the  specified  mixtures. 

Generalization  to  distributions  other 
than  the  Bernoulli  is  direct. 
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